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@ A method and apparatus for retrieving relevant documents from a corpus of documents 



@ A method and apparatus accesses relevant 
documents based on a query (230). A tliesaums 
of word vectors (242) is formed for the words in 
the corpus of docunrtents (240). The word vec- 
tors represent global lexical co-occurrence pat- 
terns and relattonships between word 
neighbors. Document vectors (246), which are 
formed from the combination of word vectors, 
are in the same multi-dimensbnal space as the 
word vectora. A singular value decompositton is 
used to reduce the dimensionality of the docu- 
ment vectors. A query vector (232) is formed 
firom the combination of word vectors as- 
sociated with the words in the query. The query 
vector and document vectors are compared to 
determine the relevant documents. The query 
vector can be divided into several factor dus- 
ters to fomi factor vectore. The factor vectors 
are then compared to the document vectors to 
detemtine the ranking (252) of the documents 
within the (actor duster. 



INPUT 
QUERY 



-230 



COUnJIEQUERf 
COKTEXT VECTOR 



'232 



STORE QUERY 
COKTEXT VECTOR 



INPUT 
OOCUUENTS 



COUPUTE OQCUHENT 
CONTEXT VECTORS 



X.242 



STORE DOCUMENT 
CONTEXT VECTORS 



COUPUIE 
CORROATXm COEFnCIEKT 
FOR EACH DOCUVIENT 



250 



RWX OOOUUENTS 



252 



STORE RWKDIG OF 
DOCUUENIS 



00 

00 



FIG. 12 



liJ 



Jouva. 16. rue SahM)enb. 75001 PARIS 



EP 0 687 987 A1 



This invention relates to improvements in retrieving relevant documents from a corpus of documents. More 
particularly, this invention detennines the cooccurrence patterns of words in a document to form a thesaurus 
of word vectors. 

Information retrieval systems typically define similarity between queries and documents in terms of a 

5 weighted sum of matching words. The usual approach Is to represent documents and queries as long vectore 
and use similarity search techniques. One problem is that a query and a document may share ambiguous words 
that are used In a different sense. Therefore, a false similarity match could occur (ambiguity problem). Asecond 
problem Is that the same content may be expressed in different words. Therefore, a short query may miss a 
relevant document (synonymy problem). If a document is relevant but uses words synonymous to words In the 

10 query, the document cannot be found. 

One solution is to lengthen the query through relevance feedback. After conducting the first search using 
an initial query. Additional words are added to the query to narrow the search for the next search iteration. 

Another solution is to expand a query through synonym relations as found in thesaurus. A synonym relation 
is one that finds "closely related words" as opposed to "syntactically and semantically interohangeable words". 

f 5 Thus, the procedure siniply maps from one word to other closely related words. 

For a thesaurus to be useful in information retrieval, it must be specific enough to offer synonyms for words 
as used in the corpus of interest. For example, in a corpus of computer science documents, the word "inter- 
preter' would have meanings quite different from everyday language. A thesaurus must also cover all or most 
of the words found in queries, including the potentially unbounded set of proper nouns. 

20 These two considerations suggest that generic thesauri, which are restricted to common usage, are un< 
likely to be helpful. Thus, the thesauri must be tuned to the corpus of interest These might be hand built for 
a restricted domain or computed from the text of corpus itself. 

A thesaurus is a data structure that defines semantic relatedness between words. It is typically used in 
information retrieval to expand search terms with other closely related words. Even if the thesaurus Is not ex- 

25 plicitly computed, the ntapping performed by query expansion explicitly defines a thesaurus. 

The simplest, and perhaps most conventional, approach to thesaurus construction is to manually build 
an explicit semantic mapping table. This is dearly laborintensive, and hence only possible in specialized do- 
mains where repeated use may justify the cost. For example, the RUBRIC and TOPIC text retrieval systems 
require a domain expert to prepare a hierarchical structure of "topics" germane to a particular subject area. 

30 See McCune et al.; "Rubric, A System for Rule-based Information Retrieval"; IEEE Transactions on Software 
Engineering 9; pp. 939-44; 1965. Each topic Is a boolean combination of other toptos and search terms. Search- 
ere then employ terms from this hierarchy to form queries that automatically expand to complex boolean ex- 
pressions. 

Another approach is to reuse existing on-line lexicographic databases, such as WordNet (see Voorhees 

35 et al.; "Vector Expansion in a Large Collection"; Proceedings of TREC. 1 992.) or Longman's subject codes (see 
LMdy et al.; "Statistically-guided Word Sense Disambiguation"; Woridng Notes of the AAAl Fall Symposium 
on Probabilistic Approaches to Natural Language; 1992 AAAl Press). However, generic thesauri of this sort 
will often not be specific enough for the text oollectton at hand. For example, in Voorhees et al., "acts" Is ex- 
panded with the meaning "acts of the apostles" in a corpus of legal documents. In addition, they frequently 

40 do not record information about proper nouns, yet proper nouns are often excellent retrieval cues. 

Corpus* based methods perform a computation on the text of the documents In the corpus to produce a 
thesaurus. For example, a hierarchical thesaurus is formed from a computer list of complex noun phrases 
where subsumption roughly corresponds to the subset retatton defined on terms. e.g., "Intelligence" subsumes 
"artificial intelligence*. See Evans et al.; "Automatic Indexing Using Selective NLP and Firet-order Thesauri"; 

45 Proceedings of the RIAO; Vol. 2, pp. 624-43; 1991. While this method Is superior to approaches that treat 
phrase temfis as unanalyzed segments, there is no notion of senriantic similarity of basic terms. For example, 
the semantic similarity of "astronaut" and "cosmonaut" Is not represented In the hierarchy. 

Head-modifier relationships can be used to determine semantic closeness. See (jrefenstette, G.; "Use of 
Syntactic Context to Produce Temt Assodatton Lists for Text Retrieval"; Proceedings of SIGIR 1992; pp. 89- 

50 97. See Ruge, G.; "Experiments on Linguistically-based Term Associations"; Information Processing & Man- 
agement 26(3); pp. 317-32; 1992. This solution Is costly since pareing technology Is required to determine 
head-modifier relations in sentences. It is also undear to what extent words with similar heads or modifiere 
are good candkJates for expansion. For example, adjectives refening to countries have similar heads ("the Jap- 
anese/Chilean capital", "the Japanese/Chilean government"), but adding "Japanese" to a query that contains 

55 "Chilean" will rarely produce good results. 

Semantic relatedness can be determined by considering the occurrence of terms in documents. See 
Crouch, C; "An Approach to the Automatic Construction of Global Thesauri"; Information Processing & Man- 
agement 26(5); pp. 629-40; 1990. Documents are dustered into small groups based on similarity measure. 
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Two documents are considered similar if they share a significant numt)er of tenns, wfth medium frequency 
terms preferentially weighted. Terms are then grouped t>y their occurrence in these document dustere. Since 
a oompiete-link document clustering Is performed, the procedure Is very compute Intensive and does not scale 
to a large reference corpus. Further, the central assumption that terms are related if they often occur in the 
5 same documents seems prol)lematic for corpora with long documents. It also does not capture the intuitive 
notion that synonyms do not oo-occur. but rather have similar co-occurrence patterns. 

Crouch constructs thesaurus classes by grouping words into bins of related words. Unfortunately, the 
boundaries between classes will be inevitably somewhat artificial. If dasses are made too small, some words 
will be cut off from part of their topical neighborhood. If dasses are too large, words will be forced Into dasses 
10 with words from different topics. Any particular dass size wID either separate some words from dose neighbore 
or lump together some words with distant temis. 

A thesaurus can be constructed by defining a similarity measure on terms within the document. See Qiu 
et al.; 'Concept Based Query Expansion"; Proceedings of SIGIR 1993. Terms are represented as high- 
dimensional vectors with a component for each document in the corpus. The value of each component is a 
15 function of the frequency the term has in that document They show that query expansion using the cosine f 
similarity measure on these vectors improves retrieval performance. However, the time complexity for conv 
puting the similarity between terms is related to the size of the corpus because the term vectore are high- 
dimensional. Thus, the method cannot be used on a large scale. 

A large-scale singular value decomposition can be used for information retrieval. See Deerwester et at.; 
20 'Indexing by latent Semantic Analysis*; Journal of the American Society of Information Sdence 41(6); pp. 
391407; 1990. An improved retrieval performance results by inducing representations for documents that re- 
flect term dependendes and remedy the bumpiness of small counts. However, undetected term dependendes 
and small counts are a problem if document occurrence is the basis of representation. 

The Deerwester method Is similar to context vectore, because the LSI (latent semantic indexing) algorithm 
25 can compute document and query vectors as weighted sums of term vectors. However. LSI is based on docu- 
ment occurrence. Decomposing a term- by-document matrix of a large collection can take days or even weeks 
because the time complexity is quadratic in the number of documents to process. 

Documents can be represented as vectors whose entries correspond to microfeatures such as finance, 
animal kingdom, etc. See Gallant. S., I.; 'A Practical Approach for Representing Context and for Performing 
30 Word Sense Disambiguation Using Neural Networks'; Neural Computation 3(3); pp. 293-309; 1991 . Word vec- 
tore are manually encoded for a medium number of words. Then, the document vectore are computed as sums 
of word vectore. However, this hand-encoding of documents is laborintensive. Furthermore, the microfeatures 
chosen or the manual encoding may not accurately reflect the spedf ics of the document collection. 

Machine-readable dictk>naries can be used to derive 'context vectore." See Wilks et al.. 'Providing Machine 
35 Tractable Dictionary Tools'; Machine Translation; Vol. 5, No. 2. pp. 99-154; 1990. The representation of a word 
is based on words it co-occure with In a dictionary rather than a corpus. Context vectore are then computed 
from these dictionary-based word representations. This method has the same problems as other dlcttonary- 
based approaches, tn particular, a genre-specific distinctton that is not covered in the dictionary is not ade- 
quately represented In the dictionary-based representations. 
40 Some researchere argue against the utility of cooccurrence-based expansion. See Peat et al.; The Llnrv 
itations of Term Co-occurrence Data for Query Expansion in Document Retrieval Systents"; Journal of the 
American Sodety for Informatfon Sdence 42(5); pp. 378-83; 1991. Because synonyn^s often do not occur to- 
gether, they assert that a oo-occurrence-based approach may have difficulty identifying synonymy relations. 
Furthermore, many researchere use measures for defining doseness that will group words according to fre- 
45 quency. By using these measures, it Is impossible fbr a frequent word to have an infrequent neighbor. 

An object of the invention is to form a new corpus based method for constructing a thesaurus based on 
lexical co-occurrence of terms In the corpus. 

Another object of the invention is using a method for accessing and browsing docuntents based on content 
similarity. Words and documents are represented as vectors in the same muiti-dlmenslonal space that is de- 
so rh^ed from global lexical co-occurrence patterns. The method forms an improved retrieval performance for non- 
literal matches with queries. 

The computation of the lexical co-occurrence thesaurus proceeds In two phases. First, the lexical cooc- 
currence pattern of each word is presented as a nrultidlmenslonal vector, the thesaurus vector. Second, a sim- 
ilarity measure is induced on words by comparing these vectore. Given a particular word its synonyms are then 
, 58 defined to be its nearest neighbore with respect to the similarity measure. This method of exploiting a lexical 
co-occurrence structure of words. I.e., forming a word's vector representation from entries of its near lexical 
neighbore rather than from only itself is superior to conventional methods. 

Lexical co-occurrence is more informative both qualitatively and quantitatively. Two terms lexically cooccur 
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if they appear In text within some distance of each other, i.e.. a window of k words. Qualitatively, the fact that 
two words often occur dose to each other Is more iilcely to be significant than the fiact that they occur in the 
same documents. Quantitatively, there are more cooccurrence events than occurrence-in-document events 
in a given document collection. For a word occuning n times In the document collection and for a definition of 
5 cooccurrence as occurring In a window of k words, there are nk co-occurrence events. However, there are only 
n occurrence-in-document events. If the goal is to capture infomfiation at>out specific words, iexk»l co-occur- 
rence is the preferred basis for statistical thesaurus construction. 

Synonyms rarely co-occur. Synonyms tend to share neight>ors that occur with tX3th. For example, litiga- 
tion* and 'tawsuif share neighbors such as 'courf, "judge", and "proceedings". A thesaurus represented by 
10 lexical cooccurrence patterns defines semantic closeness in temrts of common neighbore. Thus, synonyms are 
not required to cooccur, but they must have similar co-occurrence patterns. 

A multi-dimensional continuous space is formed where each word's thesaurus vector represents its indi- 
vidual position. A continuous space does not force a classification choice, and hence avoids some of the en- 
suing problen^s. 

f 5 The dimensionality of the thesaurus space is reduced by using a singular value decomposition. The close- 
ness of terms with equal frequency occurs because the terms have about the same number of zero entries in 
their term vectors. For a given term, singular value decomposition assigns values to all dimensions of the 
space, so that frequent and infrequent terms can be dose in the reduced space if they occur with similar terms. 
For example, the word "accident," which may occur 2590 times, and the word "hilshaps," which may occur only 

20 129 times, may have similar vectors that are close despite the frequency difference between them. The tech- 
nique of singular value decomposition (SVD) is used to achieve a dimensional reduction by obtaining a compact 
and tractable representation for search purposes. The uniform representation for words and documents pro- 
vides a simple and elegant user interface for query focusing and expansion. 

After forming the thesaurus vectors, a context vector for each document Is computed. The context vector 

25 is a combination of the weighted sums of the thesaurus vectore of all the words contained in the document 
These context vectors then induce a similarity measure on documents and queries that can be directly com- 
pared to standard vector-space methods. 

Another application for the thesaurus vectore Is to divide a query Into sub-topics. A vector for each query 
sub-topic is formed and compared to the document vectors. The document vectore are then scored and ranked 

30 by the degree to which they simultaneously match the subtopics of the query. 

The present invention will be described further, by way of examples, with reference to the accompanying 
drawings, in which:- 

Figure 1 is a block diagram of an apparatus for determining lexical co-occurrence of terms within a docu- 
ment or query; 

3S Figure 2 is a flow diagram of the Buckshot clustering algorithm; 

Figure 3 shows the query formulation using WordSpace; 

Figure 4 shows the query formulation using DocumentSpace; 

Figure 5 Is a flow diagram for computing the word vectors for the thesaurus; 

Figure 6 shows the Matrix A computed In the flow diagram of Figure 5; 
40 Figure 7 shows the Matrix B computed in the flow diagram of Figure 5; 

Figure 8 shows the Matrix C computed In the flow diagram of Figure 5; 

Figure 9 shows the reduced Matrix C computed in the flow diagram of Figure 5; 

Figure 10 is a flow diagram for computing context vectors for documents; 

Figure 11 shows a document context vector; 
45 Figure 1 2 Is a flow diagram for ranking the documents based on the query context vector and the document 

context vectore; 

Figure 13 shows the memory locations for the ranking of documents; 
Figure 14 Is a flow diagram for forming factor dustera of document vectore; 
Figure 15 shows the mennory locations for the factor dustere; 
50 Figure 16 is a flow diagram for ranking the document vectore baaed on the factor dustere; 

Figure 17 shows the memory locattons for the ranking of document vectore by factor dustere; 
Figure 18 shows the final ranking of documents based on the fector duster ranking; and 
Figure 19 Is a graph showing the predslon points computed by the context vector and the factor duster 
vector methods. 

S5 Figure 1 is a block diagram of a preferred embodiment according to the invention of an apparatus 10 for 

determining lexical co-occun-ence of terms with in a document or query and forming a thesaurus. The apparatus 
10 indudes a user input device 12 which includes, for example, one or more of an image scanner (optical or 
electronic), a keyboard, a touchscreen, a mouse, etc. The image scanner can be a stand-alone device, or part 
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of a fecstmlle machine or document reproducing machine (copier). A processor 16 is connected to the input 
device 12 for processing the document image into co-occurrence vectors and comparing the vectors. Processor 
16 operates to perform these functions in accordance vvtth operating programs read from read oniy memory 
(ROM) 20, and by using random access memory (RAM) 22. Documents can be stored or retrieved from files 
5 24. Processing results and user input Information can be monitored on a CRT display monitor 14. After proc- 
essor 16 has completed processing the documents, the results can be output to an output device 18, which 
includes, for example, a storage means (hard or floppy disk), a printer, a photocopier, a facsimile machine or 
a CRT display. 

The first preferred embodiment of the invention forms the lexical co-occurrence based thesaurus. Each 
10 term of the documents is associated with a vector that represents the term's pattern of local co-occurrences. 
This vector can then be compared with others to measure the co-occurrence similarity, and hence semantic 
similarity of terms. 

In information retrieval systems such as SMART and Sli^, documents and queries are represented as 
vectore in term space. See Salton el al.; 'Introduction to Modern Information Retrieval*; McGraw-Hill, New 
15 York; 1 983. These vectore can be represented as an encoding scheme of order-0, which Is defined as: 
Order-0 encoding for words: 

^ : f^V\ 
Onler-0 encoding for documents: 

25 

where 

V • a set of words; 

W| -a word i in V; 

35 ^ - word encoding; 

D - a set of documents; 

dj - a document J in D; 

^ - document encoding; 

Id/ • the number of tokens in dj; and 
40 tgt - the k^ token in document dj 

The word encoding function ^ maps words into a multklimensional space with as many dimensions as 

there are words. There Is no overlap between word vectore. In the order-0 encoding, each word corresponds 

to a vector with exactly one non-zero weight One entry for its own dimension. In the simplest case, this weight 

is one. The representation H'(d]) is computed for document d| by summing up the vectore of all tokens occurring 
45 in it 

Similarity between the vector representations for words Is measured by the cosine f unctkm: 



50 



55 



COSlWj.Wj) = — ^^===z==^==:==- — 

Equatton 3a is used to determine topical or semantk: similarities between two words. 

Similarity between the vector representations for documents is measured by a similar cosine function: 
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CO$(di.dy) = 



2 ...(3b) 



Queries may be viewed as short documents and hence way also be represented as vectors. Search pro- 
ceeds by searching near neighbors to the query vector in document space. The assumption is that queries 
10 and documents are similar to the extent that they contain the same words. IHowever, the same content can be 
expressed with very different words. Thus, a query and a document could have a similarity measure of zero 
in this simple scheme even though the query content can be understood as a reasonable description of the 
topic of the document This is the problem of synonymy of words. The scheme of order-0 fails to capture syn- 
onymy. 

15 Another problem is that a query and a document can shars ambiguous words. Thus, the word may be used 

In a different sense in the document than in the query. In this case, the query and the document may have a 
high degree of similarity according to a measurement of the cosine function of equation 3b even though the 
query and the document do not overlap in the intended topic 

The reason for the discrepancy between vector similarity and actual similarity in word content is that the 

20 absence or presence of a gh/en word provides little information if words are treated as unanalyzed symbols 
or indices in term vectors. 

In the order-0 scheme, the word vector representation was constructed from one entry for the word. The 
generalization of order- 1 representations is to construct a word vector representation from entries for the 
word's neighbors in the document collection. These vectors can be represented as an encoding scheme of or- 
25 derl, which Is defined as: 

Order-1 encoding for words: 

2 E , ...(4) 

Order-1 encoding for documents: 

35 



A vector of order-1 for word W| Is the sum of the vectors of all tokens ^ that are neighbora of One of the tokens 
^ Of Wi in one of the documents dj in the collection D. Being a neighbor is defined as occurring at a distance 
of less than W intervening woids. A reasonable setting of W Is SO words. So a word is represented as the sum 

45 of its neighbors. To make the similarity between the two approaches explicit, an occurrence of a word is defined 
to be its own neighbor of order-0 and ordinary nelghbore are defined to be neighbors of order-1. 

Document vectora of order-0 and order-1 are derived in a completely parallel fashion by summation from 
their respective word vectors. 

Synonymous terms have similar neighbors and hence will contribute a similar 'direction' in the multidl- 

50 menstonal space of document vectore. Ambiguous terms have two different sorts of neighbore. In computing 
a document vector, those terms that correspond to the sense used In the document will be reinforced whereas 
the direction represented by the inappropriate sense will not be present in other words. Therefore, order-1 rep- 
resentations can be expected to alleviate the problems of synonymy and ambiguity. 

Because the length of an order-1 representation Is equal to the number of unique terms in the vocabulary, 

55 the order-1 representattons are dense and require a lot of storage, which is a severe disadvantage. The order- 
1 vector of a long document will hardly have any zeros since almost every word is a neighbor of one of the 
document's terms somewhere In the corpus. In contrast, the order-0 vector of a document has about as many 
entries as there are words in the docunnent so it can be stored and processed eff teiently. 
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One solution to this probiem is to perform a dlmensionailty reduction of the order- 1 vectors by means of 
a singular value decomposition, which is disclosed In Oeerwester et al. cited above. It can be used to find a 
linear approximation of the original high-dimensional space (one dimension for each word) in an r-dimenslonal, 
reduced space, for an r on the order of 1 (P. The properties of singular value decomposition guarantee that vec- 
5 tor similarity In the original space Is preserved In the reduced space. 

For the purpose of the singular value decomposition, all order-1 representations of the vocabulary V are 
collected Into a I V I x I V I matrix X such that row i of matrix X contains the order-1 vector of word i. i.e.. ^t(W|). 
Any rectangular matrix (including square matrices such as matrix X) can be decomposed into the product of 
three matrices: 

10 X = ToSoDo* (6) 

such that matrices To and Do have orthonormal columns and matrix Sq Is diagonal. This is called the singular 
value decomposition of nrtatrix X. Matrices To and Do are the matrices of left and right singular vectors, respec- 
tively, and matrix Sq is the diagonal matrix of singular values. Singular value decomposition (SVD) is unique 
up to certain row. column and sign pennutations. By convention the diagonal elements of matrix Sq are con- 

15 structed to be all positive and ordered in deaeasing magnitude. 

In general, for X = TqSoDq' the matrices To, Do. and So must all be of full rank. Singular value decomposition 
allows a simple strategy for optimal approximate fit using smaller matrices. If the singular values in matrix Sq 
are ordered by size, the first k largest ntay be kept and the remaining smaller ones set to zero. The product of 
the resulting matrices is a matrix which is only approxl mately equal to matrix X, and is of rank k. It can be shown 

20 that the new matrix is the matrix of rank k, which is closest in the least squares sense to matrix X. Since zeros 
were introduced into matrix So. the representation can be simplified by deleting the zero rows and columns 
of matrix Sq to obtain a new diagonal n^trix 5, and then deleting the corresponding columns of matrices Tq 
and Dq to obtain matrices T and D, respectively. The result is a reduced model: 

X«X = TSD' (7) 

25 which is the rank-k model with the best possible leastsquares-fit to matrix X 

The approximation of X In the k-dimens tonal space amounts to a dimension reduction from the original 
IVI dimensional to the k-dimensional space. The new lower dimensional representations for words are the rows 
of matrix T. Row i of matrbc T is the reduced vector representation of word W|. So all computatk)ns can now be 
done efficiently in the reduced space. Thus, a more compact word representation of order-0 and order-1 are 
30 as follows: 

Compact order-0 encoding: 



...(8) 



Compact order-1 encoding: 



Reduced order-1 encoding: 
50 ♦;:VW->R' (10) 

where Vn is N most frequent content words; v„ is word n in Vn; and r are the dimensions of the reduced space. 

A singular value decomposition Is an expensive operation. Therefore, compact representatk>ns can only 
be derived for a small part of the vocabulary using singular value decompositton. Let N be the number of words 
for which a singular value decompositnn is feasible given the computattonal means available in a particular 
^ setting. For a large workstatton, N Is about 5.000. Then the derivation of compact representations for the whole 
vocabulary proceeds in two steps. 

in the firet step, an order-1 representatton Is derived using a restricted vocabulary of N words, and a 
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10 



singular value deoomposltion computes a low-dime nslonal encoding ^"i and ^\ are calculated using equations 
8-10. The definitions for encoding words are analogous to those In equations 1-2 and 4-5 except that the re- 
stricted vocabulary VN containing the N most frequent content words is used. 

In the second step, the representations for the total vocat)ulary are computed by summing up the reduced 
order-1 vectore of aO neighbors of a given won]. The foDowing equation represent ofder-2 encoding. 

Order-2 encoding for words: 

cr/cD t>> = Vn i1-/fl<W ' 
Order-2 encoding for documents: 
v2 : D^Rr 



20 



...(12) 



Since order-1 vectors are based on word neighbors, the representation for a word W| that is derived in this 
way contains information about the neighbors of the neighbors of W| in the document collection. Therefore, it 
Is called a representation of order-2. Similar to the order-0 and order-l functions {^o and order-2 function 
25 4'2(dj) is computed by summing up the vectors of the tokens occumng in dj. The order'2 representations may 
contain slightly more infonnation than the order-1 representations for tasks like resolving synonymy and am- 
biguity, but the iteratton is mainly done to restrict the vocabulary to a size that can be managed by the dimen- 
sionality reduction. 

The sum of all words in a document is a good topic descriptor for short documents. However, long docu- 
30 ments tend to contain words from different topics. If too many topics enter in the computation of a document 
vector, then the document vector will be in a region that Is at an intermediate distance to ail Its topics, but not 
particularly dose to any of them. Thus, any fine distinctions between topics are lost. More precisely, the space 
of word vectors can be viewed as the surface of a partial hypersphere In a multidimensional space that is cen- 
tered around the global centroid. In computing a document vector according to the summation formula of equa- 
ls tions 11 and 1 2, each addition of a word that is not related to the document's major topic, and therefore located 
*on the opposite side" of the global centroid. will push it towards that centroid. Vectors in the area of the global 
centroki are equally close to everything. Therefore, these vectors are of limited utility In informatton retrieval. 
This problem is not particular to higher-order representations. 

Acomputattonally simple approach to segment long documents is to duster the set of tokens In a document 
40 Into a set of coherent subtopic dustera. A token is a single item In full text For example, the word the" appeare 
several times in a document as a token. Allneartlme dustering algorithm such as Buckshot (Cutting at al. 1992) 
can be used. See Cutting et al.; 'Scattar-gathen Aauster-Based Approach to Browsing Large Document Col- 
lecttons"; Proceedings of SIGIR 1992. See also U.S. Application No. 07/790,316 to Pedereen et al.. which Is 
Incorporated herein by reference. 
45 The Buckshot method, which employs three subprocedures, will be described briefly. The firet subproca- 
dure, truncated group average agglomerate dustering. merges disjoint document aete. or groups, starting with 
IndlvkJuals until only k groups remain. At each step the two groups whose merger wouM produce the least de- 
crease in average similarity are merged Into a single new group. 

The second subprocedure determines a trimmed sum profile from selected documents dosest to a docu- 
50 ment group centroid. Given a set of k document groups that are to be treated as k centere for the purpose of 
attracting other documents, It Is necessary to define a centroid for each group. Documents far from the cen- 
troid are trimmed out to determine better focussed centere, and hence to more accurately assign individual 
documents In the third subprocedure. 

The third subprocedure assigns individual documents to the dosest center represented by one of these 
55 trimmed sum profiles. 

Refening to Figure 2. the steps of the Buckshot ntethod are shown. In step 30, a random sample of C is 
constructed from corpus C of 
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The sampling is performed without replacement In step 92, sample C is partitioned into k groups by truncated 
group average agglomeratlve clustering. This partition is called partition G. In step 34, partition P is constructed 
from corpus C by assigning each indhndual document to one of the centers in partition G. This is accomplished 
by applying assign-to-nearest over the corpus C and the k centers of partitton G. In step 36, partition G is re- 
placed with partition R Steps 34 and 36 are repeated once. In step 38, partition P is the new corpus. 

After clustering the document, each document can then be described by the centroids of its subtopic dus- 
ters. The derivation of the structured document representations of order-2 are: 

Order-2 clustering function: 



VCi,C2 y(d!/,ci)ny(c(f,,Q) = 0 -Oa) 
15 uac Y{dpc) = {t;> for}^k^ \dj\} 

Order-2 encoding function: 

...(14) 



Where Y partitions the corpus of documents D Into subtopic dusters C; and ^2' assigns centroids of subtopic 
dusters to documents. The partitioning functton makes sure that words pertaining to different topics remain 
separated In different dusters, thus avokiing document vectors that are too dose to the global centroM. 

An example of vector representation of each document will be next described by using the user interfece 
TwinSpaces. TwInSpaces uses the methods of thesaurus vectors and document vectors defined in this appli- 
cation. TwInSpaces has two parts: WordSpace (generation of word vectors) and DocumentSpace (generation 
of document vectors). 

TwInSpaces is kJeal for a user who has a dearly defined informatton need, but who may not be fully aware 
of the vocabulary that Is used for the topic of interest in the document cdlectton. For example, a journalist who 
is doing research on an artlde is only Interested in documents related to the topk; of the artide. But there are 
many pieces of Information that are relevant, yet cannot be searched for, because the Journalist doesn't know 
about them yet. Thus TwInSpaces is ideal for this type of seai^, which Is intermediate between literal retrieval 
and simply browsing all of the documents. 

WordSpace is based on a corpus of 50 million words corresponding to 18 months of the New York Times 
newswire. By using the order-2 scheme. 55.000 word vectors are computed. 

An alternative method Is to compute vectors for 5,000 letter fburgrams instead of words in iteration 1 (for 
^\). Afourgram is a sub-word fragment of four letters. For example, the word "package" is divided into fourgram 
fragments of "pack". *acka', 'ckag', and *kage*. There are fewer fourgram combinations than there are words 
In the English language. The dimensionality of the fourgram space was reduced to r = 97 using singular value 
decomposition. 

The informational significance of individual terms can be evaluated by looking at their nearest neighbors. 
Table 1 shows ten randomly selected words and their nearest neighbors In order-2 space: average (financial 
markets); Eschenbach (classical music); gill (fishing); Indonesians (South East Asia); markup (retailing); Novell 
(computer industry); pleaded (court proceedings); rags (fashion); shriveled (cooking) and Sodeties (political 
organizations). The topical characteristics of the selected words are obvious from looking at the few nearest 
neighbors given. For the words "average" and "rags," the user has to look at 30 neighbors to get a dear sense 
of their topics (the f inandal markets and fashion, respectively). The neighbors of the word "Societies' suggest 
that it is related to political and religious organizations, but no clear topic emerges. This indicates that the word 
is an ambiguous and. consequently, less useful search term. 

Higher-order representations for words thus make possible a novel type of user interface for information 
access. In tradittonal Information retrieval systems, It Is hard to assess the Impact of the terms used in a query. 
The user communicates with the system on the level of document descripttons. In the case of relevance feed- 
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back, one specifies which documents returned as response to the original query should be the basis for the 
next search iteration. Communication in terms of documents is difficult Document titles are often uninfbmia- 
thfe or do not represent crucial parts of the content of a document It also takes a relatively long time to read 
and evaluate them with respect to the user's information needs. 
5 in systems based on order-0 representattons. the user can only assess the Impact of search terms indir- 
ectly by analyzing the retrieval results for varying search terms. In contrast, the higher-order word represen- 
tations provide a simple and elegant user interface for evaluating the significance of search terms. A look at 
their nearest neighbors is usually sufficient to get a good idea of the topic that the word pertains to as denv 
onstiBted in Table 1. 

10 With higher-order word representations, the user examines nearest neighbors as direct diagnostics for 
each individual term, thereby composing a query that matches the Information needs of the user more dosety. 
A word "shriveled' that may have been Intended as a description of dry. old skin can be replaced by a related 
word that is used In the document collection to express the intended concept The user can also check whether 
a proper name like 'Eschenbach* is used for a specific person in the corpus (here it is the conductor Christopher 

15 Eschenbach). If there were a tennis player of the same name, then It would make Eschenbach less useful In 
a search for documents on classical music 

Query formulation and rsfinement In WordSpace uses the sample layout of Figure 3. A query is shown In 
the sectk>n CURRENT QUERY. These words are collected in the pool of words to be examined In the section 
called POOL The user can add or delete words from this pool depending on the output of the search. The conv 

20 mends shown in the MENU section are used to form and process the query in WordSpace. 

The crucial operation of the user interface Is retrieval of nearest neighbors, either a small number (e.g. 
10) that is presented directly for inspection, or a large number that is clustered and presented to the user in 
digested form. Indivklual words or the whole pool can be evaluated by looking at their nearest neighbors. The 
nearest neighbors of the pod (tank, artillery) are printed in the sectbn SEARCH RESULT. By Inspecting these 

25 neighbors, the user can rr^ke sure that the 'receptacle' sense of the word "tank* will not interfere with a query 
on military equipment 

A clearly ambiguous word like 'tank' can be clarified to narrow the search. Table 2 shows ambiguity re- 
solution with word vectors of order-2. The nearest neighbors suggest that higher-order vectors deal with anv 
biguity and synonymy to some extent, even without user interaction. The example of "tank* shows that the in- 
30 formation present in higher-order vectors can be used to resolve ambiguity, which Is one of the main problems 
for representations of order-O. In order-0 retrieval, the undesired sense of an ambiguous search term may lead 
to the retrieval of documents that are not related to the query. 

Table 2 shows that the word tank" can be disambiguated both ways in WordSpace. A query consisting of 
the terms "tank* and "water* retrieves only words relevant to the 'receptade' sense of the word "tank." The 
35 query consisting of the terms "tank" and "artillery* Is apparently located In a part of the space that corresponds 
to the "armored vehide' sense used in Figure 3. 

An order-0 retrieval system will only do well on documents that contain both the ambiguous and the dis- 
ambiguating term, but it will give the same ranking to documents that contain only one of them (e.g. only water 
or only tank). The nearest neighbors in Table 2 suggest that a higher-order system would remedy this situation. 
40 if the word 'tank* occurs in two documents, but water doesn't, the document containing words related to water 
(such as pipes or flush) will be rated higher than the one pertaining to a different topic. 

The generallzatton to higher-order representations also addresses the problem of synonymy, tn order-0 
retrieval, the occurrence of the word "coast" in one document and the use of its synonym "shoreline" In a related 
document will not increase the similarity of the two documents. The higher-order scheme can exploit synonymy 
45 since "coast" and "shoreline" have similar neighbors. Therefore, their higher-order vectors will contribute a slnrv 
Mar direction in the multidimensional space to their respective document vectors, which in turn makes the docu- 
ment vectors more similar. Table 2 gives two examples of words whose synonyms have almost Identical di- 
rection in the multidimenstonal space. 

DocumentSpace. which is the second part of TwlnSpaoes. contains 34.000 articles from the New York 
so Times newswire between the months of June and November of 1000. Each docunnent vector was computed 
according to equations 11 and 12. as the sum of the vectors of all Its tokens. 

Figure 4 shows the user interface for DocumentSpace, which is parallel to the user Interface for Word- 
Space. The user Interface has functions for adding to and deleting from the pool and for kicking at the nearest 
neighbors of the pool or an individual document The current query in WordSpace can be used to start a search 
55 in DocumentSpace. 

The query "Japanese American research" was performed using DocumentSpace. tn the SEARCH RE- 
SULT sectton of Figure 4, the ten top ranking documents are shown. These documents seem to conform well 
with the query although there are few literal matches. In fact, a retrieval method based on literal matches would 
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do badly on a query that contains such general terms. Because of Documents pace's Independence from literal 
matches. It also does well even on one-word queries. 

There Is a benefit of the parallel design of WordSpace and DocumentSpace. For a query consisting of 
words, the topic and content of a document can be described by its near neighbors in WordSpace, in complete 

5 analogy to the retrieval of document neighbors In DocumentSpace. Table 3 shows the titles of five random 
and one selected document Table 4 displays the nearest neighbors of the articles displayed in Table 3. These 
nearest neighbors show that the neighborhood of a document In the space of word vectors is a good charac- 
terization of its topic for short, topically focused documents. 

By reviewing the information in tables 3 and 4. the user sees that documents 132. 14387, and 4579 are 

10 onetopic documents that are represented by words that characterize their content Documents 1 3609, 22872, 
and 27081 are long documents with more than one topic. Therefore, their document vectors are doser to the 
global centrotd. Their nearest neighbors are function words, because function words share the characteristic 
of having a large number of words from different topics as their neighbors. 

A good charecterizatlon is achieved for ail documents with the structured representations of equations 1 3 

15 and 14. Such a characterization n^kes the user less dependent on well chosen document titles. It Is easier to 
process than a long title, particularly if only the coarse topic of a document is of interest In general, the nearest 
neighbors provide additional information on the content of a document that can be used for document selection 
or relevance feedbadc. 

The randomly selected words in Table 1 were used as queries in DocumentSpace to demonstrate the In- 
20 formation present in the word vectors of order-2. Appendix A shows the ten highest ranking documents for each 
query term In Table 1. With the exception of Societies, the document titles generally correspond to the topic 
described by the nearest neighbore of the query term. The cause for the religious artides retrieved for Societies 
may be the use of this term in the phrase "Center for Religion and Human Rights in Closed Societies.' But the 
mismatch between word neighbore and document neighbors b only a problem If Societies Is used as a search 
25 term although its word vectore suggest a topically mixed neighborhood. 

The first preferred embodiment described above is a rich representation of words and documents that is 
based on global information about the document collection. The first preferred embodiment is superior to the 
literal representation used in dassical vector similarity search. In the invention, the representatton of words 
provides a user Interface for underetanding and manipulating wordword, word-document, and word-query re- 
do lations. The representation of documents gives rise to a similarity measure that reflects topical relatedness 
better than a scheme based on literal matches. 

The first preferred embodiment requires the formation of a co-occurence-based thesaurus, which is 
formed by computing and collecting a (symmetric) term-bytemi matrix C. Each element Cg of matrix C records 
the number of times that words 1 and j co>occur in a window of sae k. For example, the window k can be forty 
35 or fifty words. Topical or semantk: similarity between two words can then be defined as the cosine between 
the corresponding columns of the matrix C as defined in equation 3a. The assumption is that words with similar 
meanings will occur with similar neighbors If enough text material Is available. 

The matrix C has vV2 distinct entries, where v Is the size of the vocabulary. Although this matrix is spares, 
v Is expected to be very large. Therefore, the overall storage requirement needed to form the cooccurrence 
40 thesaurus Is unworkable. 

Even If enough memory were found to represent the matrix C directly, the thesaurus vectore assodated 
with each word (columns of the matrix C) would be vdimensional. Although these vectore are somewhat spares, 
this implies that word comparisons are an order v operation, which is prohibitively expensive for large scale 
application. 

4$ Thus, the dimensionality of the problem must be reduced to a workable size by using a singular value de- 
contpositlon of a matrix of co-occurrence counts. However, this matrix must be constructed In a aeries of steps 
to keep the computations tractable at each state. 

The constructton of the thesaurus will be described with reference to Figures 5-9. The goal Is to apply a 
singular value decompositton to reduce the dimensionality of the matrix In a disciplined fashton and In the proo- 

so ess produce more compact representattons. However, the time to perform the singular value decomposition 
Is proportional to n^. where n is the dimensionality of the matrix. Thus, the dimensionality of the matrix fed 
into singular value decomposition cannot be too high. In particular, the original matrix C cannot be used. In- 
stead, a two stage oomputatton is performed that derives two sets of topical word dasses from the corpus: 
200 word dustere of low coverage; and 200 word dustere of high coverage. Thus, the dimensionality of the 

55 matrix is preserved wfthout sacrif idng too much information. 

A topical word dasses agglomerate Information over similar words. In Figure 5, step 100 computes the 
word and word pairfrequendes in the corpus. The word pair is two consecutive words in the corpus. For ex- 
ample, the title The Journal of Computer Science" has four word paire: The Journal; Journal of; of Computer; 
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and Computer Science. 

In step 1 02, words and word pairs of low frequency are dropped from consideration. For example, any word 
with a frequency of one and any word pair with a frequency of less than five are not important The word pair 
"computer science" wiD appear throughout the text Therefore, it wlii probably appear more times than the other 

5 three word paire. which will only appear In the title. 

In step 104, the Matrix A is formed from the original corpus. The full co-occurrence matrix is constructed 
for a subset of terms in the corpus. For example, 3,000 medium frequency words (frequency ranks 2.000 
through 5,000) are chosen for this subset Figure 6 shows Matrix A with the dimensions of 3000 words by 3000 
words. Element ay of the matrix records the number of times that words W| and Wj co-occurred In a window of 

fo 40 words in the text collection. 

Step 106 fomr^ the first set of topical word classes by clustering Matrix A into groups. The clustering al- 
gorithm is based on the cosine similarity between the columns of matrix A. For example. 200 Adasses gAi, 

gA2 gAzoo using group average agglomerative clustering are found. These 200 word dusters are considered 

low coverage dusters. The Buckshot method is one example of a method to duster the groups. 

15 A second matrix B is formed in step 108 by considering e larger vocabulary subset For each word in this 
larger B-subset matrix B records the number of times words in each A-dass occur in neighborhoods around 
that B-subset word. Each element by records the number of times the wj co-occurs with any of the medium- 
frequency words from dass gAj. This is similar to the usual cooccurrence matrix construction except that the 
matrix is no longer symmetric. Figure 7 shows Matrix B, which has rows corresponding to A-dasses, i.e., col- 

20 umns to words. For example, the B-subset contains the 20,000 most frequent words, exduding stop words. 

In step 110, this B-subset is again partitioned into 200 word dasses-by dustering the columns of matrix 
B. The purpose of this second iteration is to ensure that each word in the corpus has a sufficient number of 
neighbors from at least one word dass. If only A-dasses is used, then many words would have no co-occur- 
rence events. In contrast, every word co-occurs with several words in the B-subset and hence will have many 

25 cooccurrence events with respect to B-dasses. The 200 word dusters (gBi.gB2—gB2oo) generated are of high 
coverage. The Buckshot method is one example of a method to duster the groups. 

In step 112. a third co-occurrence matrix C is collected for the full corpus vocabulary versus the Bclasses. 
Thus, a matrix is formed having the full vocabulary versus a representative of the full vocabulary. Element Cij 
contains the number of times that term j cooccurs in a window of k words with any word in dass get. Referring 

30 to Figure 8, matrix C has b rows and v columns. For example, ail 176,116 words that occurred at least twice 
in the collection ere used and all 272,914 paira of adjacent words that occurred at least 5 tirrtes, for a total of 
449,030 unique terms. 

At step 1 1 4, a singular value decomposition dimensionality reductbn to p (p < b) is perfonned so that each 
of the V terms can be represented as a compact p dimenstonal vector and also to improve generalization. The 

35 reduced matrix C Is shown In Figure 9. 

To reduce compute time in the example, only a subset of the matrix, corresponding to the 1000th through 
6000th most frequent word, was decomposed. This decomposition defined a mapping from the 200 dlmen- 
stonal B-dass space to a 20 dinrtensional reduced space. By applying the mapping to each of the 449.030 200- 
component B-dass vectore, a smaller 20-dimen8lonal vector was computed for each word and pair. 

40 Note that the final reduction in dimenstonallty was performed because smoothing and improved generality 
results from a singular value decomposition reductton. Similarity between b-oomponent vectore can contain 
a large error measure of semantic similarity since there may be several word dasses with similar topics. For 
example, dass gB4 contains words like "navy", radar", and "missile", while some of the member of dass gB47 
are "tanks", "missiles", and "helicoptere". If one of two words has many nelghbore in gB4 and the other has many 

45 In gB47. then they would not be similar in the 200dlmenstonal space; but they are similar in the reduced space. 
This is because the singular value decompositton algorithm recognizes and eliminates such redundandes. 

Four passes through the corpus are required to complete the computatton. The f Iret pass computes word 
and word pair frequencies. The second pass computes Matrix A and the A-dasses. The third pass computes 
Matrix B and the B-dasses. Finally, the fourth pass computes Matrix C. In addition, Matrix C Is decomposed 

50 by using singular value decomposltton to compute the thesaurus vectors. 

If the Tipster Category B corpus Is used, each pass through the corpus takes roughly 8b( houre (Includes 
CPU and I/O time). Tipster is a corpus of documents controlled by the government (NIST-Nattonal institute of 
Standards and Tests) to further information retrieval methods. Note that these oomputatbns could have been 
acceterated by using loosely coupled coarse-grained parallelism to effect a linear reduction in compute time. 

55 The singular value deoompositton requires roughly 30 minutes to compute. 

The net effect of this oontputatton is to produce for each unique term a dense p-dimensbnal vector that 
characterizes its oo-occurrenoe neighborhoods. These vectore then define a thesaurus by associating each 
word with Its nearest netghbore. 
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An example of some of the associations found using the Tipster category B corpus are: 
accident repair; faulty; personnel; accidents; exhaust; equipped; MISHAPS; injuries; sites 

advocates passage; PROPONENTS; arguments; address; favoring; compromise; congress; f^ors; 
urge 

5 litigation LAWSUITS; audit; lawsuit; file; auditore; auditor; suit; sued; proceedings 

tax taxes; income tax; new tax; Income taxes; taxpayers; incentives; I.EV1ES; taxpayer; corporate 

taxes 

treatment drugs; syndrome; administered; administer; study; administering; PROCEDURE; undergo; 
aids 



10 Each row displays a word and its nine nearest neighbors. For example, 'repair' is the nearest neight>or of 'ac- 
cident*. Word pairs used as terms are displayed without being separated by a semicolon. Words in upper case 
are hand selected synonyms as might be found in a manually constructed thesaurus. They are particularly 
interesting because they are unlilcely to co-occur with their mates and hence illustrate that this thesaurus con- 
struction effectively uses second-order co-occurrence (sharing neighbors in the corpus) rather than simple 

IS first-order co-occurrence (occurring next to each other) to find synonyms. 

The second preferred embodiment uses the computed thesaurus vectors to perform a search for relevant 
documents. To use this infonnation directly in the search, a similar representation for documents is needed. 
The document vectors that are computed are called "context vectors.' The simplest approach is to represent 
each document by a vector, which is the sum of the thesaurus vectors for the words in Its text Formally, 



25 where -f/dj Is the vector for document j; Wg Is the weight for word 1 in document J; and -¥h/\ Is the thesaurus 
vector for word i. Queries may be represented as vectors by using equation 15. 

An example of weighting the words In the document Is by using an augmented tf.idf method (term f requen- 
cyinverse document frequency method) when summing thesaurus vectors: 



where tfg is the frequency of word i in document j; N is the total number of documents; and ni is the document 
frequency of word 1. As the word frequency increases In a document the weight (score) for that word also In- 
creases. However, the tenn N/ni is invereely proportional to document frequency such that high frequency 
words receive less weight 

^ For example, the frequency of the word "the" is high for a document Therefore, this word has a high weight 
for the document However, this word occurs In almost every document of the corpus. Thus, the weight of the 
word drops because It has a low inverse document frequency. 

The context vectors dj depend only on the underlying thesaurus vecfore. Thus, this method is automatic. 

40 The document vectors are a derivation from the corpus. Therefore, the chances that the representations are 
tuned to the relevant topics increases. 

Figure 10 shows the process of computation of context vectors. In step 200. the query or document Is load- 
ed into the processor 16 (see Figure 1). All of the words in the query or document are extracted in step 202. 
in step 204. the thesaurus vector is retrieved for the first word extracted. The thesaurus vector is added to the 

4S context vector for the document in step 206. If there are more words to process from the document then the 
flow returns to step 204 to retrieve the thesaurus vector for the next word. If all the words are processed In 
step 208, then the context vectore are normalized in step 210. The equation to normalize the context vector 
is: 



20 



...(15) 



30 




50 




...(17) 



55 



where p is the number of dimensions in the reduced space; and ^i is the context vector. By normalizing the 
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context vectors, all of the context vectors will have the same length regardless of the size of the document 

Figure 11 shows the normalized context vector for the query or docunwnL Each context vector has 20 re- 
alvalued dimensions. 

Figure 12 shows the process of using context vectors to retrieve relevant documents for a query. In step 
5 230, a query Is entered Into the processor by the user. The processor computes the context vector for the query 
in step 232 using the flow diagram of Figure 10. The computed context vectors are stored by the processor in 
step 234. The documents are retrieved by the processor in step 240. This step can be performed before or in 
parallel with the query processing. The context vectors for each document are computed in step 242. The con- 
text vectors for each docunrwnt are stored In RAM or a permanent storage system in step 246 to be used for 
10 additional searches. 

In step 250. the correlation coefficient is computed based on the context vector of the query and the context 
vectors of the corpus of documents. The correlation coefficient is computed using the cosine function descri- 
bed earlier (see equation 3). IHowever, the denonninator is equal to one. Thus, the equation for the oonBlation 
coefficient is: 

IS 

COrf{dudj) u 2 ^ ndi)k V{dj)k ...(18) 

20 where d| is the query vector and dj is the document vector. 

After calculating all of the correlation coefficients, the documents are ranlced in step 252 from most rele- 
vant to least relevant. The ranking can be displayed on the monitor or printed on the printer In the alternative, 
the ranldng can be stored in a RAM or permanent storage device. Referring to Figure 1 3. each memory location 
stores the rank, the document identification and the correlation coefficient. 

25 The process described In Figure 12 was perfbmried using the Tipster category B corpus, which has over 

450,000 unique terms from about 173,000 Wall Street Journal articles. The context vectors were computed 
for the 25 Category B topics of the Tipster collection. For each query, documents were ranked according to 
vector similarity as computed by the correlation coefficient and precision/recall statistics collected. The results 
of the invention were compared against a baseline standard vector space similarity search with augmented 

30 tf.idf term weighting. 

To achieve better results than previous methods, schemes that combine the scores from the tf.idf baseline 
and context vectors was used. Fonmaliy. document ranks of the form are considered: 

r* = a • rtfjdf (1 - a) ♦rev (19) 
v^ere rev is the context vector rank; rtfj^r Is the tf.idf rank; and a Is a free parameter between 0 and 1. 

35 Figure 19 shows a precision graph for 11 points of recall. First, a search is performed to retrieve possibly 
relevant documents. The documents are analyzed to determine the number that are actually relevant to the 
query. The precision of the search Is the ratio of the number of relevant documents to the number of retrieved 
documents. The recall of the search is the ratio of the number of relevant documents to the number of relevant 
documents in the corpus. Thus, as the recall of the search is Increased, the precision of the search decreases. 

40 In the graph of Figure 19. the bottom line uses the recall points for tf.idf. The middle line uses the recall 
points for linear combination for the optimal choice of a. which is 0.7. Thus, the average precision for tf.idf is 
0.271 and the average precision for the linear combinatton of tf.idf and context vectora is 0.300. The top line 
are the recall points for word factorization method, which will be described next 

A third preferred embodiment uses the thesaurus vectore to analyze the query into topic-coherent word 

4$ groups, which are called word factors. The goal is to ensure that documents are relevant to the entire query 
such that their score with respect to each factor Is high. In addition, word factore may be manually screened 
for relevance. Word factore containing nuisance or nontopical terms can be deleted from the query. 

Figure 14 shows the process for query factorization. In step 300. the query Is input Into the processor. The 
processor retrieves the thesaurus vectore for the words in the query in step 302. These retrieved thesaurus 

50 vectore can be temporarily stored in the RAM. A clustering algorithm is used In step 304 to cluster the retrieved 
thesaurus vectore based on different topics or factors. The clustered vectora are stored in the RAM in step 
308 according to the relevant factor. 

In Rgure 15. the memory locations are divided into factore. The document vectore 310 are assigned to 
the relevant factor and stored. 

55 Figure 16 shows the retrieval of documents using the word factor method. In step 320, the three factore 
of the query computed by the process shown in Figure 1 4 are retrieved. The factor vector for each factor duster 
is computed in step 322. The factor vectore for each factor cluster are generated using the following equation: 
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. . ...(20) 

6 

where -^ffm is the factor vector for duster m; is the weight for word i In cluster m; and -WV| Is the thesaurus 
vector for word i. The factor vectors are stored in step 324. 

In step 330, the documents of the corpus are retrieved Into the processor. The document vectors for each 
document are computed in step 332 by using equation 15. The docuntent vectors are stored in step 334. In 
10 step 340, the correlation coefficients t)etween the computed docunrant vectors and the factor vector are com- 
puted by using the following equation: 

corriUdj) = S ? = 1 I'ifm) k v(dj)k ■M^) 

IS where H'iiir) is the factor vector for factor duster and 4'(dj) is the context vector for document dj. 

In step 342, the documents are ranked based on the correlation coefficient assigned and the appropriate 
factor. The ranking of the documents within a factor is based on correlation: 

corr(f„,dj) (22) 

The rank of dj according to this ranking Is r^ij); and corr(fn,.dj) is the correlation of factor m and document]. 
20 Figure 1 7 shows the memory divided into sections for each factor. In each section, the documents are 
ranked from highest to lowest Each factor rank is assodated with the document identif icatton and the corre- 
lation coeff ident. 

In step 344, the documents are ranked based on the maximum rank of the factors. Thus each document 
Is ranked based on the ranking in each factor by using the equation: 
25 r(j) = max„(r„a)) (23) 

where r(j) is the ranking of document j; and r^O) is the rank of document j for factor duster m. This algorithm 
corresponds to imposing a boolean constraint on the subtopics of a query. 

The result of the maximum rank is stored in step 346 of Figure 16. The memory locations of the final ranking 
is shown In Figure 1 8. The highest ranking document Is most relevant to the query. The memory lists the rank, 
30 the result of the maximum ranking of equation 23, and the document identification number. 

Query factorization was used to retrieve relevant documents dealing with trade conflicts between the Uni- 
ted States and European countries on subsidies to the aircraft industry. A group average agglomerath^e dus- 
tering was used to group query terms into factors based on their thesaurus vectors. Each topic was dustered 
Into three word factors. All directly juxtaposed words occurring at least five times In the corpus were used as 
35 terms. 

• Internatbnal politics- aid, assistance, laritlsh, 

code, complaint, consortium, controversy, douglas, economics, european governments, financ- 
ing, french, german, government assistance, governments, international economics, loan, objection, 
petition, policy review, producer, retaliatton, review, sanctions, Spanish, Spanish government, tension 
40 • the aircraft Industry- aeronautlcas, aeronautlcas 

8.a, aerospace, aerospace pic, aerospatlale, airbus, airbus Industrie, aircraft, aircraft consortium, 
blohm, boelng. boelkow, boelkow blohm. britlsh aerospace, construcctones, oonstrucdones aeronautl- 
cas, douglaa oorp, european aircraft, gmbh, mcdonnell. mcdonnell douglas. messerschmitt, messersch- 
mitt boelkow, pic, s.a 
45 • trade conflicts- airbus subsidies, anti dumping. 

countervailing, countervailing duty, dumping, dumping duty, federal subsidies, gatt, general 
agreement, review group, subsidies, tariffs, trade dispute, trade policy, trade tension 
One outstanding problem of similarity search is that it treats search terms as If they were In one large dis- 
junction. By scoring each factor separately and reoombining them appropriately, documents are scored highly 
50 on all factors, and thus Introduce a conjunctive constraint 

For example, a document may score high for a query as a whde although It deals with only one of the 
subtopics of the query. Many high-scoring documents are about the aircraft Industry without mentioning trade 
conflicts or internattonal politics. Instead of evaluating the query as a whole, each subtopic should be evaluated 
Individually and the results combined. If a document is irrelevant to one of the Important subtopics of the query, 
55 then it often is irrelevant as a whole. For example, a document on the aircraft industry without any mention of 
trade Is irrelevant 

The second goal of word factorization Is to eliminate in-elevant words semi-automatically. Many words in 
the Tipster topic descriptions are not relevant for the query In questton, but they should not be placed on a 
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Stop list either because they could be relevant for other queries. For example, topic description 75 Is about 
failed or successful automation. The topic is identified as belonging to the general area of "Science and Tech- 
nology". Therefore, 'science" is one of the terms of the query. However, it Is not relevant for the query. One 
of the word factors of the topic 75 is the follawing: 

5 • failed; Instance; force; conversely; science 

This word factor doesnt contain good search terms and was therefore not used in retrieval. The decision wheth- 
er a word factor was relevant or not was made manually. The word factors that were Judged relevant were then 
combined according to the algorithm described above. 

Refening to Figure 19, a linear combination of tf.Idf and context vectors to evaluate document rank with 

10 respect to each factor proved superior to using either method on Its own (average precision 0.308 for ranking 
based only on tf.idf. 0.267 for a ranking based only on context vectors). The top line In Figure 19 shows the 
precision for 11 recall points for a = 0.66. Average preciston is 0.3218. This Is a f h^e percent improvement over 
the tf.Idf result of 0.271. 
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APPENDIX A 

The ten highest ranking documents for the query average are: 

- bank fund and cd yields are lower for fifth week 

- money fund and cd yields drop for 4ih suaight week 

- bank yields on certificates of deposiu fall narrowly 

• cd and bank yields mixed during week 

• adds 1973 data; drops repeated figures for 1974<6. 

• cd and bank fund yields continue to decline 

- bank funds. cd*s mixed for week 

- cd yields off a 1 0th week; savings units' rates mixed 

- yields on cds and money market accounu fall for 9th week 

- yields on cds and money market accounts fall for 9th week 

The ten highest ranking documents for the query Eschenbach are: 

• the recorded legacy of leonard bemstein 

- music world pays tribute to leonard bernstein 

• mostly mozart to open with one-act opera 

- maurice gendron, french cellist and conductor, dies at 69 

- san francisco plans festival for mozart*s bicentenary 
-daniel guilet, violinist, is dead; beaux arts trio founder was 91 

• Hollywood bowl will create second orchestra 

- bernstein retires, citing poor health 

• some riffs on a dream jazz festival 

- richard lewis, tenor, dies at 76; sang baroque and modern works 

The ten highest ranking documents for the query gill are: 

- vandal vessels invade the atlantic 

- summer anglers take note 

• driven from pacific, asian driftnet fishermen now ply atlantic 

- red snapper' shortage possible within S years 

- for release tuesday, sept. 18 

- Japanese drift-net fishers face protest from conservationists 

- the keys to success for the saltwater fly fisherman 

• dried arrangements 

- outdoors: atlantic bonito are swift and strong 

• battle to preserve florida reef intensifies 
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The ten highest ranking documents for the query Indonesians are: 

- Vietnamese pay a high-level visit to china 

- khmer rouge moves thousands of refugees into malarial 'liberated^. 

- conflia between u.$. and asia over boat people intensifies 

• asian nations criticize u.s. refugee policy 

- boat people's plight seen as likely to worsen 

- beijing and Singapore agree to set up diplomatic links 

• beijing on a diplomatic roll 

• former dissident in mainland china now facing charges in taiwan 

• rivals shun conference 

- end to Cambodian killing fields could begin in hanoi this weekend 

The ten highest ranking documents for the query markup are: 

- supermarkets now sell the space on their shelves 

- lucky stores to offer home grocery delivery 

- now fewer firms are chasing small investors 

- retailers adapt sales strategies to uncertainties of holiday season 

- detroit responds to rental-car challenge 

- liz Claiborne sets discount increase 

- waUmart set to buy discount club chain 

- hypermarkets not generating whopping sales 

• upgraded image helps penney's bring in shoppers 

- g.m. seu incentive offer on many of its 199! models 

The ten highest ranking documents for the query Novell are: 

- tandy introducing home computer 

- hitachi plans hewlett chip 

- amdahl posts 41 percent profit gain 

- ashton-ute sells software produa 
-unixpaa disclosed 

- apple extends adobe license 

- mips forecasts possible deficit 

• new sun printer and software 

- consortium's new software 

- ask computer acquires Ingres corp. 
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The ten highest ranking documents for the query pleaded are: 

• federal judge aguilar sentenced to six n^onths prison 

• ex-juror in goland election-fraud trial accused of offering to rig ^ 

• fornner tean^ster leader guilty of conspiracy with federal judge 

• the various charges in the marcos trial 

• jury to be asked about camarena news articles 

; federal judge is given reduced jail term in corruption case 

• plea bargain falls through for drug defendant 

- victim's family rejects bensonhurst deal 

• los angeles: evidence in the case. 

- barry's lawyers, federal prosecutors begin talks on plea bargain 

The ten highest ranking documents for the query rags are: 

- the three-handkerchief men 

- unseen traveling companions 

• under the sun and out of it. too 

- sparkle for the ears 

• commentary: marble vases, gloria steinem and mary kay 

- rockers turn their garb into garble 

- now that fall's here, it's time for haute couture 

• printed overalls, for when pans sizzles 

• 'gone with the wind' mural continues artist's heritage 

- what's haute is haute 

The ten highest ranking documents for the query shriveled are: 

- lustau reserva light fino sherry, solera jarana; about $8. 

• on the politics of green-bean cookery 

- slow-cooked seasonal fare: a reason for fast-paced cooks to stew? 
-undated: communal baths. 

- summer's still warm at greenmarkets 

- there's yumi. and there's yucky yogurt 

- a Pennsylvania tradition: the soft preuel 

- those great old decadent delicious dinersl 

- the ethnic varieties of the urban pickle 

- when whatever goes in the pot is whatever's in the pantry 
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The ten highest ranking documents for the query Societies are: 

• pope issue key document on r oman catholic higher education 

- churches. s.f, mayor begin new push to enlist aids volunteers 

- leader of eastern orthodox christians begins visit to u.s. 

- leader of easter orthodox christians begins visit to u.s. 
interfaith effort aims to^ase children's suffering 

- interfaith effort aims to ease children's suffering 

- blahoslav $. hruby, presbyterian minister, dies at 78 

- rabbi kelmaa leader of conservative Judaism, dies at 66 

- greek orthodox group wants to permit married bishops 

- Vatican, Jewish groups see need to fight anti-semitism in east 
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TABLE 1 



word 


nearest neighbors 


average 


averages lowest ratio percent percentage averaging per 


Eschenbach 


orchestra MutI Orchestra orchestras RIocardo symphony 


gill 


nets fishermen fishing fishery fisheries fisherman fished 


Indonesians 


Indonesian Indonesia Timorese Thais Kong Hong Malavlsan 


nnarkup 


markups sells customers sell buying buys customer salespeople 


Novell 


ncr Com Manzl Kodak ibm Cowen Quist Hambrecht Lotus 


pleaded 


guilty plead felony plea pleads conspiring felonies conspirary 


rags 


hat neatly neat beside hangs lit tall rag fake gaze eyed cowboy 


shriveled 


lean thin spiked soft plain dips Into bursting bit appetites 


Societies 


societies Society social fraternal society auspices fostering 
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TABLE 2 




retrieval vector 


nearest neighbors 


45 


tank ♦ artillery 


tanks talk artilery armor howitzers armored amnrmnitbn 




tank + water 


pipes water tank flush ocean reservoir shore underwater 


50 


coast 


Coast coastline coastal Ocean shore ashore Inland sea coasts 




expensive 


inexpensh/eMy expense prohibitively cheaper dispensing 


55 




costlier extenstons costly costing 
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HiilGia 


11119 


5 




cxpenmeniai vBocine proisCKS cniinps sgainsi aias infeciion 


10 


13609 


kathleen brown wants to tie governor, not treasurer, hayes aide says 


14387 


Israeli intelligence says Palestinians planning attacks against 


15 


22872 


corning, n.y.: and engineers (Coming's affirmative action program) 




27081 


income-tax package may not be as unbalanced as it seems 


20 


4579 


a low-fat menu for holiday outings 



TABLE 4 



25 


article 


nearest neighbors 




132 


causes injecting defect onset dose injected cure cause commoniy prolong 


30 


13609 


has part reputation been for could also made no Mr 


35 


14387 


caliing termed official Amencan refrain swiftly provoked persuade besieged disrupt 


22872 


also part other has many mora others from an most 


40 


27081 


would thereby coupled also gap under bulk new than sum 




4579 


eggs mix bean soda mixing nuts egg jar slice beef 


46 


Claims 





1 . A method for generating a thesaurus of word vectors based on lexical co-occurrence of words within docu- 
ments of a corpus of documents, said method including: 

retrieving^a word from the corpus; 

recording a number of times the word co-occurs with a co-occurring word located within a prede- 
termined range, the co-occurring word being any word located before and after the word within the pre- 
determined range; 

repeating the recording step for every co-occurring word located within the predetermined range; 

and 

generating a word vector for the word based on every recorded number. 

2. A method as claimed in daim 1, wherein a matrix of word vectors is formed using all of the generated 
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word vectors of the corpus. 

A method as claimed in daim 2. wherein forming the matrix including: 

forming a first matrix from a f ^t subset of words within the corpus, each element of the first matrix 
recording the numt>er of times that two words within the first subset co-occur in the predetermined range; 

clustering the first matrix into groups to form a set of low coverage word dusters; 

forming a second matrix from the low coverage word dusters and a second subset of words within 
the corpus, the second subset containing more words than the first subset, each element of the second 
matrix recording a number of times that each word of the second subset cooccurs with each duster of 
the set of iow coverage word dusters within the predetermined range; 

dustering the second matrix into groups to form a set of high coverage word dusters; 

forming a third matrix from the high coverage word dusters and aii of the words of the corpus, each 
element of the third matrix recording a number of times that each word of the corpus co-occurs with each 
duster of the set of high coverage word dusters within the predetermined range; and 

redudng dimensionality of the third matrix to represent each element of the third matrix as a conv 
pact vector and, optionally, 
before forming the first matrix, the steps of: 

computing word frequency and word pair frequency of the words of the corpus; and 

removing words and word pairs with low frequency. 

A method of detemtinlng relevant documents in a corpus of documents based on a query, the method 
induding: 

generating a thesaurus of word vectors based on co-occurrence of words within each of the docu- 
ments of the corpus of documents; 

generating a context vector for each document, the document context vector based on the word 
vectors from the thesaurus for each word located in the document; 

generating a context vector for the query based on the word vectors from the thesaurus for each 
word in the query; 

computing correlation coefficient for each document based on the query context vector and the 
document context vector; 

ranlclng each document based on the computed correlation coefficients; and 
outputting the ranking of each document 

A method of determining relevant documents In a corpus of documents based on a query, the method 
induding: 

generating a thesaurus of word vectore based on the co-occurrence of words within each of the 
documents of the corpus of documents; 

generating a factor vector based on a clustering of word vectors of the query; 

generating a context vector for each document, the document context vector based on the the> 
saurus word vectors for each word located In the document; 

computing correlation coeff idents for each document based on the factor vector and the document 
context vector; 

ranlcing each document within a factor duster based on the computed correlation coefficients; 
computing maximum rank of each document based on a combination of the ranks of the document 
in each factor duster; and 

outputting the final rank of each document. 

A method as daimed in dalm 5. wherein the generating of the factor vectore comprises the steps ot 
retrieving the thesaurus word vectore for each word In the query; 
performing dustering algorithm to form factor dustere of the query; 
generating a factor vector for each factor duster. 

An apparatus for generating a thesaurus of word vectore for each word in a corpus of documents, the 
word vectore being based on the lexical co-occurrence of words within each of the documents, comprising: 

a memory containing the corpus of documents; 

an extractor for retrieving a word from the corpus; 

a counter recording the number of times the word co-occure with a co-occunring word located within 
a predetermined range, the oo-occuning word being any word located before and after the word within 
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58 



the predetermined range, the counter recording the number for every cooccurring word within the prede- 
termined range; 

a generator generating a word vector for the word based on every recorded number; and 
an output for outputting the word vectors in the thesaurus. 

8. An apparatus for retrieving relevant documents from a corpus of documents based on a query, compris- 
ing: 

a memory containing the corpus of documents; 

a thesaurus of word vectors for each word of the corpus, the word vectors being based on lexicai 
cooccurrence of words within each of the documents; 

a processor for generating document vectors and a query vector based on the word vectors, each 
document vector being a summation of the word vectors that are associated with the words iocated within 
the document; the query vector being a summation of the word vectors that are associated with the words 
iocated within the query; 

detemr^lning means for determining a cooccurrence correlation relationship between the document 
vectors and the query vector; and 

output means for outputting the relevant documents based on the correlation relationship deter- 
mined by the determining means. 

9. An apparatus as claimed in claim 8, wherein the query vector is a plurality of query factor vectors, each 
query factor vector based on a duster factor of the query. 

1 0. An apparatus as claimed in daim 9, wherein the determining means ranks the documents in factor dusters 
based on the correlation relationship between the document vectors and the query factor vectors of the 
query vector. 
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WordSpace 



CURRENT QUERY 
tank artillery 
SEARCH RESULT 

100. tanks 101. tank 102. artille^ 

103. armor 104. howitzers 105. armored 

106. ammunition 107. bombardment 108. battollions 
109. howitzer 



POOL 

0. tank 1. artillery 
MENU 
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- <N> DELETE word <N> from pool 

n retrieve near NEIGHBORS of pool 
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OocumentSpace 



CURRENT QUERY 
Japanese American research 
SEARCH RESULT 

100. mili plans basic research center open to foreign firms 

101. new economic institute seeks to aid u.s. companies in global 

102. can japan make einsteins too? 

103. nations forge plan to share technologies 

104. joint u.s.-japan poll finds growing perception gap on trade issue 

105. miti forms environmental protection reseorch institute 
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